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Abstract — The Load-Balanced Router architecture has re- 
ceived a lot of attention because it does not require centralized 
scheduling at the internal switch fabrics. In this paper we 
reexamine the architecture, motivated by its potential to turn 
off multiple components and thereby conserve energy in the 
presence of low traffic. 

We perform a detailed analysis of the queue and delay 
performance of a Load-Balanced Router under a simple random 
routing algorithm. We calculate probabilistic bounds for queue 
size and delay, and show that the probabilities drop exponen- 
tially with increasing queue size or delay. We also demonstrate 
a tradeoff in energy consumption against the queue and delay 
performance. 

I. Introduction 

The concept of a Load-Balanced Router was studied at 
length at the beginning of last decade. See for example (6), 
0, ED, ED, S, 0, D21- In this work we analyze various 
performance aspects of a Load-Balanced Router, motivated 
by the potential energy saving enabled by this architecture. 

Energy efficiency in networking has recently attracted a 
large amount of attention. One of the main aims in much 
of this work is captured by the slogan energy-follow s-load, 
also known as energy-proportionality . In other words, we 
wish to make sure that the energy consumed by a networking 
device matches the amount of traffic that the device needs to 
carry. This is in contrast to more traditional architectures for 
which the device operates at full rate at all times even if it is 
lightly loaded. Indeed, by a conservative estimate in a study 
conducted by the Department of Energy in 2008, at least 
40% of the total consumption by network elements such as 
switches and routers can be saved if energy proportionality 
is achieved. This translates to a saving of 24 billion kWh per 
year attributed to data networking [1]. A recent study lfj"2"1 
further confirms that the power consumption of some state- 
of-the-art commercial routers stays within a small percentage 
of the peak power profile regardless of traffic fluctuation, for 
example the significant daily variation in traffic load [26|. 

Various approaches have been proposed in order to achieve 
energy-proportionality. Speed scaling, also known as rate 
adaptation, and powering down are two popular methods for 
effectively matching energy consumption to traffic load. The 
former refers to setting the processing speed of a network 
element according to traffic load. It is typically assumed that 
the energy consumption is superlinear with respect to the 
operating rate. The latter refers to turning off the element at 
certain times and so it either operates at the full rate or zero 
rate. Both methods are the subject of active research, though 



most of the work focuses on optimizing an individual element 
in isolation HI, d, J29), |23), 0, 0, E), US, ll25l . 
fl3l . A central question to both methods is to set the speed 
so as to minimize energy usage while maintaining a desirable 
performance, e.g. latency or throughput. 

The Load-Balanced Router architecture has the potential 
to handle the traffic in such a way that portions of the device 
can be turned off in response to lightly loaded traffic. We 
provide what we believe is the first detailed analysis of queue 
size and delay in a Load-Balanced Router, and we provide a 
tradeoff between energy consumption and queue size/delay. 
In order to describe our results in more detail we now give a 
brief description of the Load-Balanced Router architecture. 

A. Motivation for Traditional Load-Balanced Architecture 

One of the most fundamental goals of any router architec- 
ture is to achieve stability, which is sometimes referred to as 
100% throughput. In other words the router aims to process 
all the arriving traffic so long as no input and no output 
are inherently overloaded. The key difficulty with doing this 
is that the arriving traffic may be highly non-uniform, i.e. 
if Aik is the arrival rate for traffic going from input i to 
output k, we will typically have Aik ^ Aw for ik ^ i'k'. 
Early work on switching considered a crossbar architecture 
in which matchings between the inputs and the outputs are 
set up at every time step. It was shown in 11241 that the 
Maximum Weight Matching algorithm (with weights equal to 
the backlog for each input-output pair) can ensure stability. 
Subsequent papers looked at simplifications of this scheduler 
that could still achieve stability. 

However, a major drawback of all these approaches is 
that they require a centralized scheduler with information 
about the backlogs of data on each input-output pair. A 
solution is the Load-Balanced Router that could make use 
of randomized routing ideas first proposed by Valiant ll28l . 
In the Load-Balanced Router there is a middle stage placed 
between the input nodes and the output nodes. Each arriving 
packet is routed to a middle-stage node chosen at random. 
After passing through the middle stage each packet is then 
forwarded to its designated output. In order to realize this 
architecture we place a switching fabric between the input 
stage and middle stage and between the middle stage and 
the output stage. The beauty of this design is that the random 
routing ensures that for each of these switching fabrics no 
complicated scheduling is needed. All we need to do is repeat 



a uniform schedule in which each connection is served at 
least once (6). 

B. Energy Consideration for Revisiting Load-Balanced Ar- 
chitecture 

Our motivation for revisiting the Load-Balanced Router 
architecture is that it provides an attractive framework for 
studying energy proportionality 0. For example, the number 
of active nodes in the middle stage can be reduced in the 
presence of light traffic, and increased with heavier traffic. 
The switch fabric between the input and middle stage and 
between the middle stage and the output can be functionally 
viewed as full meshes, as indicated in Figure Q] One possi- 
bility is to implement the mesh with round-robin crossbars. 
For example, Keslassy's thesis l22l assumes that the input, 
output and the middle stage are all of size n and each 
fabric is an n x n crossbar that operates in a time-slotted 
fashion. At each time slot t the first fabric connects input 
node i to middle node (i + 1) mod n and the second fabric 
connects middle node j to output node (j+t) mod n. In this 
implementation, if the number of active nodes in the middle 
stage is reduced then the crossbar can either slowdown or be 
turned off periodically. 

Adjusting the size of the middle stage or the speed of 
the switching fabric requires considerable technological and 
engineering challenges. In this note we do not aim to address 
these issues. Our focus is on analyzing queue size and delay 
given the active portion of the middle stage, which leads to 
a tradeoff between power consumption and the size of the 
active middle stage. 

C. Model and Definition 

We formally define the Load-Balanced Router architecture 
as follows. The router has n inputs and n outputs. We 
normalize the line rate such that it equals 1 on each input and 
output link. We also have a middle stage lying in between the 
inputs and outputs that consists of to nodes. The traditional 
Load-Balanced Router has m = n. However, here we treat m 
as a separate parameter. In between the input and the middle 
stage we have an n x to mesh. Effectively each link in this 
mesh operates at rate a/m for a speedup of a > 1. Similarly, 
in between the middle and the output stage we have an m x n 
mesh with link rate (3/m for a speedup factor of f3 > 1. Each 
of the two meshes is input-buffered, i.e. each input node has 
a separate buffer for each middle node and each middle node 
has a separate buffer for each output node. See Figure [T] 

From now on, we use i to index the input, j the middle 
stage and k the output. We refer to the packets that wish to 
go from input i to output k as ik packets. Similarly, link ij 
connects input i and node j in the middle stage and link jk 
connects node j in the middle stage and output k. Buffer ij 
(resp. jk) at node i (resp. j) buffers packets that are waiting 
to traverse link ij (resp. jk). 

The key to possible energy savings is that we assume that 
not all nodes need to be active at periods of low load. We 
suppose that at any time we can choose m! < to nodes to 
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Fig. 1 . A 4 x 6 X 4 Load-Balanced Router architecture, where the number 
of nodes in the middle stage can be different from the number of inputs and 
outputs. 



be active in the middle stage and that such a configuration 
requires power w(m') for some function w(j. 

When an ik packet p arrives we choose a random middle 
node j among the active nodes in the middle stage and place 
p in buffer at node ij. Q The link ij operates continuously at 
rate a/m and transmits packets in the ij buffer in a FIFO 
manner. Similarly, when p arrives at node j in the middle 
stage, it is placed in the buffer jk. The link jk continuously 
operates at rate f3/m and transmits packets in the jk buffer 
in a FIFO manner. As we can see the scheduling for both 
stages requires no centralized intelligence and is extremely 
simple. 

Lastly we describe our traffic model. As is common in 
work on scheduling in routers, we assume that packets are 
of unit size (or else are partitioned into cells of unit size). 
For any s, t let Ai k (s, t) be the amount of ik traffic arriving 
at the router in the time interval [s,t) and let Ai(s,t) = 
J2k A ik(s,t) and A k (s,t) = Y,i A ik{s,t). We assume that 
the ik traffic, the input i traffic and the output k traffic are 
(&ik,rik), (c, 1) and (cr, 1 — e) constrained respectively for 
some burst parameters cr^ and er, for some rate parameters 
r ik and for some load parameter e. In other words we assume 
that, 

A ik {s,t) < a ik + r ik (t - s) 
Ai(s,t) < <r+(t-s) 
A k (s,t) < a + (I - e)(t - s). 

We remark that the arrival rates r ik will typically vary 
over time. Indeed, the energy savings that we hope to gain 
come precisely from the fact that we can match the number 
of active components to the traffic. However, we assume that 
this happens over a slow timescale and so we perform our 
scheduling analysis as if the rates nu are fixed. 

'Note that round robin is another possibility for choosing a middle-stage 
node. However, a random choice is more robust against adversarial types of 
traffic arrivals. We do not go into details here. 



D. Results 

• In Section [TT] we make the simple statement that \fm~\ 
active middle-stage nodes suffice for handling the traffic 
load where f > J^i r ik for all k and f > ^ fc r,fc for all 
i. This in turn implies that the energy required to serve 
the traffic over the long-term is w(\fm~\). 

• In Sections [ill] to IIV-EI we present a probabilistic analy- 
sis that bounds the queue sizes at the input and middle 
stages and bounds the delay experienced by packets as 
they travel from the router input to the router output. We 
first bound the probability for a queue size to exceed 
a certain amount q, and the delay to exceed a certain 
amount of d, assuming a fixed sized number of middle 
stages. An important feature of our bounds is that they 
decrease exponentially with q and d. For a fixed traffic 
load, we then derive a trade off between the queue 
size/delay performance and the number of active middle- 
stage nodes which in turn gives a energy-delay and 
energy-queue tradeoff. 

Leaving energy minimization aside, we believe that this 
is the first detailed analysis of the delay and queue 
performance of a Load-Balanced Router, which may be 
interesting in its own right. 
« In Section [V] we present some numerical examples to 

validate our analytical findings. 
We note that our approach is different from the traditional 
powerdown and rate adaptation techniques since we will be 
directing traffic in such a way that enables some components 
to be off. In other words for the middle stage nodes we are 
not trying to match service rate to a traffic process that is 
exogenous. We are trying to match the active middle stage 
nodes to a traffic process that is under our control due to our 
ability to route within the router. 

We also remark that our bounds could be used to govern 
how many middle nodes are active in a Load-Balanced 
Router without necessarily computing all the bounds on the 
fly. We could instead precompute the bounds and create a 
simple look-up table that determines how many middle stage 
nodes should be active based on measurements of the load 
at the inputs and outputs. 

II. Throughput Analysis 

Recall that nk represents the current arrival rate of traffic 
that wishes to be routed from input i to output k. Let ri = 
J2k r ik an d let rfc = Yli r ik- Let f be such that r.; < f and 
rfc < f for all i, k. 

Lemma 1. If we use m! middle stage elements then the router 
is not overloaded if m! > fm. Hence the power required to 
serve all the traffic in the long-term is at most w(\fm~\). 

Proof: Follows from the fact that if we turn on m' 
middle stage elements then for each middle element j, 
1 < j < m', the traffic rate that will be routed on link 
ij will be at most f/m' which by assumption is at most 
(m'/m) jvn! — \/m. Since the capacity on the link ij is a/m 



for some a > 1, this implies that the ij link is not overloaded. 
A similar argument applies to each link jk between the 
middle and output stages ■ 

III. Overview of Delay Analysis 

Before diving into details of the delay analysis, we first 
provide a high level overview of our techniques. 

A. Relationship with Stochastic Network Calculus 

We use a variant of network calculus iflOl , ifTTl . ||4] 
sometimes referred to as stochastic network calculus. The 
original form of network calculus derives delay bounds by 
imposing upper bounds on the amount of traffic arriving at a 
node via arrival curves and lower bounds on the amount of 
traffic served by a node via service curves. By relating these 
two curves we can both obtain a bound on the delay suffered 
by data at a network element and also characterize the arrival 
curves for the data at any downstream nodes. However, in 
the traditional network calculus all such bounds are required 
to hold with probability 1. In our context this will lead to 
extremely weak bounds since there is a non-zero probability 
that the router will send a large number of packets to a single 
middle-stage node, thus condemning them all to extremely 
poor service. 

An alternative therefore is to use a stochastic network 
calculus in which we only wish for bounds on service to hold 
with high probability. A detailed formulation of a stochastic 
network calculus was outlined in a series of papers by Jiang 
and others lEUl . |fl9l . At a high level Jiang's approach 
obtains curves that bound the probability that the delay 
exceeds a certain amount at upstream elements, and then use 
these curves to bound the worst-case arrivals at downstream 
elements. However, we follow a slightly different approach 
since we are able to obtain better bounds by not directly 
utilizing the service curves at the input nodes to bound 
the arrivals at the middle-stage nodes. We instead base our 
calculations at the middle stage on the external arrivals of 
various ensembles of flows that might then be time-shifted 
due to delays at the input. This allows us to avoid handling 
complicated convolutions of arrival and service curves. We 
elaborate further on this distinction later. 

B. Our Approach 

We divide our analysis into a series of pieces. 

a) Bound the queue build-up at the input: Recall that 
between input i and middle-stage node j we effectively have 
a link with speed a/m. Also recall that the input has a buffer 
especially dedicated to the traffic that wishes to go from input 
i to middle-stage node j. Suppose that at some time t this 
buffer has level q. Let s be the last time that this buffer was 
empty. Note that during the time interval [s,i) the ij link 
served data of total size a(t — s)/m. Therefore the total data 
that arrived for link i during the time interval [s, t) is at least 
q + (a(t — s)/m). 

Therefore the probability that link i has a backlog of size 
q at time t is upper bounded by the probability that for some 



s < t, the amount of ik data arriving at link i during the 
interval [s,t) is at least q + (a(t — s)/m). However, recall 
that the traffic arriving to input i arrives at rate at most f and 
each packet is sent to a middle-stage node chosen uniformly 
at random. Hence, for fixed s and t, we can use a Chernoff 
bound to bound the probability that the amount of ik data 
arriving during the interval [s, t) is at least q + (a(t — s)/m). 
It is easy to show that this probability decreases exponentially 
in s and so we can use a union bound to bound the probability 
that this occurs for any s < t. 

b) Bound the delay experienced at the input: Translat- 
ing our bound on queue size into a bound on delay is simple. 
Since the transmission rate on the ij link is a/m, the event 
that the head of line packet for the ij link at time t has 
experienced delay d, implies the event that at time t — d, 
the queue size for the ij link was at least da/m. An upper 
bound on the probability of the latter event can be derived 
using the method described earlier. 

c) Bound the queue build-up at middle stage: We now 
give an overview of how we perform the analysis at the head 
of the jk queue at middle element j. This forms the crux 
of our analysis since we are in a more complicated situation 
due to the fact that the packet arrivals at the middle stage 
are affected by how they are served at the input. As before 
note that if the queue for link jk has size q at time t, then 
for some time s < t, the arrivals for link jk at middle- 
stage node j during the time interval [s, t) must be at least 
q + ((3(t — s)/m). Now suppose that the oldest of this data 
arrived at the input at time s — d, and suppose in addition 
that this data arrived at middle-stage node j on link ij. We 
can therefore state that the total amount of data arriving at 
the system in the time interval [s — d, t) that is destined for 
link jk is at least q+(j3(t — s)/m) and the delay experienced 
by data arriving at link ik at time s — d is at least d. Via 
union bounds we can calculate the probability that this occurs 
for any i, s, d. In particular, for d small we can say that the 
probability that the arrivals exceed q+(/3(t — s)/ m) is small 
whereas if d is large then the probability that the link ik 
delay is at least d is small. 

We make three points about this analysis at the middle 
stage. 

• This is where our analysis deviates slightly from the 
traditional methodology of network calculus. We do not 
calculate a service curve for the input and use that 
service curve to bound the arrivals at the middle stage. 
Instead we analyze the delay behavior at the input and 
then use that calculation to bound the middle stage 
queue size using an expression that still involves arrivals 
at the input. 

• Our initial analysis makes a slight approximating as- 
sumption in that for the ijk defined above, we treat the 
arriving traffic for link ij and the delay behavior on 
link ij as independent. In reality of course they will 
be correlated due to traffic that passes from input i to 
output k through middle-stage node j. However, this 



approximation will typically not have a large effect for 
larger values of m since the ijk traffic only forms a 
1/m fraction of the total kj. However, in order to get 
a more accurate bound, in Section IIV-EI we present a 
more detailed expression that deals with the correlation 
explicitly. 

• Our analysis is based on Chernoff bounds. However, 
the form of the Chernoff bound that we use changes 
depending on whether the bound on data arrivals that 
we are considering for a particular link is close to or 
far from the expected number of data arrivals for that 
link. This unfortunately leads to a somewhat involved 
case analysis. 

d) Bound the delay experienced at the middle stage: 
The conversion of the queue-size bound to a delay bound 
at the middle stage can be done in the exact same way as 
for the input. Now that we have an expression for the delay 
at both stages we can convert it into an expression for the 
end-to-end delay from the inputs to the outputs. 

IV. Analytical Bounds on Delay 

We now present the details of our delay analysis. We rely 
heavily on the following Chernoff bounds ll27l . In particular 
we use © to derive analytical bounds and we use ((TJ to 
derive tighter bounds for numerical simulation. 

Theorem 2 (Chernoff Bound). Let X\ , . . . , X n be indepen- 
dent binary random variables, and let n be an upper bound 
on the expectation ElY], Xj\. For all 8 > 0, 

pr£>>(i+%] < ( {1 + d s) i +s y « 

< e~ min (* 2 . 5 )-/ i / 3 (2) 

In what follows we sometimes refer to fj, as the aggregated 
mean and S as the excess factor. For conciseness we shall 
often use the aggregated mean even though /i is strictly 
speaking a bound on the mean. 

A. Input stage analysis 

We begin by computing the queue distribution at the head 
of link ij, where i G [1, n,] is an input and j G [l,m] is in 
the middle stage. As in Section [TT] we assume that r,- < f 
and r-fc < f for all i, k and we assume that m' middle stage 
elements are currently active. Initially, in order to keep the 
formulas manageable, we shall derive our formulas for the 
case in which f = 1 and to' = m. We shall also assume that 
the arriving traffic is smooth and so we do not have the burst 
terms <Xj and 07.. Later on, we shall show how to adapt the 
formulas when these assumptions do not hold. 

Let Aijk(t) be the binary random variable that indicates 
whether a packet with input i output k is mapped to 
middle stage j at time t. We use Aijk(ti,t2) to denote 
J2t=t Aijk{t), the total arrival in the duration (t\,t-2\. Let 
Qj l j(t) be the random variable for the queue size at the head 
of link ij at time t. To compute (t), let us assume s < t 



be the last time that the queue at ij is empty. For Q^j (t) to be 
larger than a value q, the arrival Aijk(s, t) over all k G [1, n] 
must be at least a ■ ^—^ + q since link ij is operated at a rate 
— . Formally, 
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B. Middle stage analysis 

We now compute the queue distribution at the head of link 
jk, where j S [l,ro] is in the middle stage and k G [l,n] 



is an output. Let cffl(t) be defined similarly as Q^y'(t). To 

(2) 

/^hound Q\,' at time t, let s < t be the last time that the 

(2) (2) 

queue Qj^ was empty. For Qj k (t) to be larger than q, there 

must be at least ^"^^ + q distinct packets in Q^J during 
the time period [s,t\. Further, let s — d be the earliest time 
one of these packets arrived at an input, say i. This packet 
must experience a delay of at least d in . Therefore, 



We now bound the right-hand-side of ©. Since Aijk(t) are 
independent binary variables, we apply the Chernoff bound to 
every term in the summation. The following is the aggregated 
mean /i and the excess factor 5 for the above probability. 



since a ■ + q = fi(l + S). If a > 2, we have 6 > 1 and 
we use © to derive, 

oo _n 

Pr[Q%\t)>q}< £ e -fe?(°-9-*< 6 3 ^ . (4) 
t- s =o !-e 3m 

If a < 2, we have 5 > 1 when t — s < J^-. Otherwise 
S < 1. We apply as follows. 

Pr[Qg>(t)>?] 



(5) 



Note that both expressions are exponentially decreasing in 
q. For the time being we shall proceed according to the case 
a > 2 since this leads to more manageable formulas. Later 
on, we shall indicate where to adapt the formulas when we 
are in a scenario where a < 2. 

Let £)y (t) be the maximum delay that some packet has 
experienced at time t in the queue Q^\t). For this delay 
to be more than d at time t, some packet must be in the 
queue at time t — d. Since link ij operates at rate ^ in a 
FIFO manner, the queue at time t — d must be at least da/m. 
Therefore, 



(6) 



Recall that the above analysis was performed in the 
absence of the burst term a. If we do have bursty traffic 
then we can adjust the formulas by making the following 
changes to the aggregated mean and the excess factor and 
propagating these changes through the resulting formulas. 
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Note that the bound (O on the delay distribution at the input 
stage is independent of the time index. We can therefore 
move Pr[3i, D^'((s)) > d] to outside the summation 
indexed by time in the second inequality above. 

We proceed to bound Yl s <t ^ r Ej ^-ijk(s — d, t) > 
based on the following expressions for the 
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expectation fi and the excess factor S. 

C = t_s+d 

\ c {t-s)(l3-l)+qm-d ) 

[ t-s+d 

since (l + S)/i = ^~ s ^ -\-q. There are two cases to consider, 
P < 2 or /3 > 2. 

For (3 < 2, we further consider the following subcases, 
depending on 22 — 1, the value of 5 when t — s = 0. Note 
that as t — s increases, the value of 6 approaches j3 — 1. 

. Case la: 1 < ^ - 1, and t - s < q ™Zj d ■ In this case 



1 < 5. Since Sfj, 



{t-s)([3-l)+qm-d 



, bound (O implies 



. Case lb: \ < $f - I, and < t - s. In this 

case /0 — 1 < 8 < 1, which implies 5 2 fi > (f3 — l) 2 /i. 

Bound (f2]) in turn implies 
• Ca.se 2; /3 — 1 < — 1 < 1, and for all values of t — s. 

In this case, /3 — 1 < 6 < 1, which is the same situation 

as lb and implies ( TTOb in the same way. 

< t-s and 



• Case 3a: 222 _ i < /3 — 1, 



dCg^-ggm < Q ^ In case 5-1 < ^ < p _ x for a jj 

values of t — s. Since <5 2 /j > M , bound (fSJ) implies 

CCD. 



Case 3b: ^-1<B-1, and < W+Vj*™ <t-s. 
In this case ^ < 5 < - 1 for t - s > rf(ff+1) ' 2l?m 



/3-1 



Since 5 2 fi > (/3 ~ 4 1} — , bound © implies O- 
Case 3c'*£-l<P- \ and t-s< <M±^2H± . We 
trivially upper bound the probability by 1 as in ([PTt . 
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Note that every term above decreases exponentially with the 
queue size q. 

The case in which /? > 2 is simpler. We omit the analysis 
for space consideration. Recall that all of the above formulas 
are for the case a > 2. If a < 2 then we need to replace all 
the factors ©, 

-da 
e 3m 



1 



e 3" 



with © 



g Sm(!-o) 



1 - e"^ 



1 



If output fc has a burst term er then as in the input stage 
we can reflect this by adjusting the aggregated mean and the 
excess factor to, 
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We conclude this section by bounding the delay distribu- 

(2) 

tion at the middle stage. Dj k be the maximum delay that 

some packet has experienced at time t in the queue Qj k (t). 
For this delay to be more than d at time t, some packet must 
be in the queue at time t — d. Since link jk operates in a 
FIFO manner, the queue at time t — d must be at least d/3/m. 
Hence Pr[Df k \t) > d] < Pr[Q$(t - d) > d(3/m]. 

We stress again that all of the above formulas are for the 
case a > 2. If a < 2 then we need to replace all the factors 
© with factors ©. 
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C. Eventual end-to-end delay 

Now that we have delay bounds for the two stages of the 
router we can obtain a bound on the end-to-end delay distri- 
bution. In the following we let g m afq) be a shorthand for the 

' (2) 

upper bound that we have derived on Pr[Qj k (i) > q] and 

fm,a (q) a shorthand for our upper bound on Pr[Q^){t) > q]. 
Suppose that an ijk packet is still traversing the router at time 
t, but it arrived at the router before time t — d. It is not hard 
to see that either it is still waiting to traverse the input stage 
at time t—i, or it arrived at the middle stage by time t — 4. 
Hence the probability that the end-to-end delay is at least d 
is at most, 

+ Pr[D%{t- d ~)>\] + Pr[D% { t)>\] 

. . dee , / dp \ 

D. Characterizing the tradeoff with the number of middle 

\elements 
+ 

I In the above delay analysis we made a number of sim- 
plifying assumptions to keep the notation manageable. For 
example we held m! — m and f = 1. We now demonstrate 
that we can use the above formulas to handle the case of 
arbitrary m' and r. This in turn allows to characterize the 
t/adeoff between energy consumption and end-to-end delay. 



The main idea is to scale time so that the arrival rate at the 
inputs is scaled to 1. We also adjust the link rates on the 
two stages of the mesh. In particular, if we wish to analyze 
a system with a given to, to', f, a and f3, we define a new 
system characterized by to, to/, f, a and (3 in which we set, 



the queue at time s — | or it must be waiting in the 



f = 1 



mr mr 
Then it is not hard to see that if we scale time by a factor r, 
the new system has exactly the same behavior as the old one. 
However, we are now working in a system with m' = to and 
f = 1. Hence we can apply the analysis that we have already 
derived. 

Our main result is thus, 

Theorem 3. If we run the router with to' middle stage 
elements then it uses energy w(m') and the probability that 
the end-to-end delay is at least d is bounded by, 

d/3 
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E. Dealing with dependence 

In our analysis of the middle stage we implicitly made the 
simplifying assumption that A^}j k (s — d,t) is indepen- 
dent from Dij(s). However, this is not strictly true since the 
arrivals for the path ijk will affect both A^ k (s — d, t) 
as well as Dij(s). Since the ijk flow represents only a 1/to 
fraction of the traffic that contributes to Dij(s), this will 
typically have a negligible effect on the eventual results. 
However, if we required a true upper bound on the probability 

(2) 

distribution of Q - k we must use the following adaptation 
of the formula, which for each i, conditions the event 
Dpj\s) > d on whether or not A^ k (s — d,t) is greater 



than 



5(t-s)P 



. More formally, 



Pr[Qf k \t)>q] 



d s<t \ % 
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5(f - s)/3 
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5(i-s)/3 



Pr[D£>((s))>d\A£ k (s-d,t) 

F. Queues at output 

We conclude this section by explaining how the analysis 
can be extended if we also wish to bound the delay on the 
output link from the router. Let Q k (t) be the queue at the 
head of the output link and recall that this link has speed 1. 
To bound Q k 3 \t) at time t, let s < t be the last time that 

(3) (3) 

queue Q k ' was empty. For Q k to be larger than q, there 

(3) 

must be at least t — s + q distinct packets in Q k during the 
time period [s,i\. Further, let s — d be the earliest time one 
of these packets arrived at an input. Suppose that the path of 
this packet is ijk. Then the packet must either be waiting in 



queue Q^ k at time s — | . In the former case the packet must 

have experienced delay of | in at time s — | and in 

the latter case it must have experienced delay of | in Q - k 
at time ,s. Therefore, 



Pr[Q<£\t)>q] 
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In addition, as before, we can convert this bound to a bound 
on delay for the third stage. Let D k be the maximum delay 
that some packet has experienced at time t in the queue 
Q k (t). For this delay to be more than d at time t, some 
packet must be in the queue at time t — d. Since link k oper- 
ates in a FIFO manner, the queue at time t— d must be at least 



d(l - e). Hence Pr[D { k } (t) > d] < Pr[Q [ k > (t-d)> d(l - 
£)]<£* nmf m ^)-mg m ^)-\(d'-d{l-e)+a k ). 



(3), 



V. Numerical Results 

In this section we show numerical examples of the queue 
bounds. In particular for these calculations we use formulas 
that are similar to those derived in Section |IV] but we use 
Chernoff bounds of the form (HJ rather than (f2} since the 
former are tighter. 

Figure |2]plots the logarithm of the probability of a middle- 
stage queue jk exceeding q packets against the queue size 
q. For this instance, the router is 20 x 80 x 20. The traffic is 
fully loaded for each of the inputs and outputs. We vary the 
speedup a = j3 in the range of 2, 3, 4 and 5. As we can see 
from the figure the logarithm of the probability decreases 
linearly with the queue size, which means the probability 
decreases exponentially with the queue. As expected, we can 
Mso see that with increasing speedup, the probability of the 
queue size exceeding q drops. 

Figure [3] demonstrates the tradeoff between the queue 
size and the number of active middle-stage nodes. For this 
instance, the router is again 20 x 80 x 20. The link rate of 
the interconnect is set to 1/20. The traffic is fully loaded for 
each of the inputs and outputs. If we keep all to = 80 nodes 
in the middle stage active, we can see from the bottom curve 
of Figure [3] that the queue size is the smallest. However, this 
option is also the most energy consuming as all 80 middle- 
stage nodes are kept active. For the other extreme, we can 
activate 40 middle-stage nodes, which is the most energy 
efficient. However, we can see from the bottom curve of 
Figure [3] that the queue size is the smallest. The curves in 
between correspond to the intermediate cases in which the 
number of active middle-stage nodes are 50, 60 and 70. 
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Fig. 2. Log of probability against queue size, for n = 20, m = 80 and 
fully loaded traffic. From top to bottom, the curves correspond to increasing 
speedup from a = j3 = 2, 3, 4 and 5. 
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Fig. 3. Log of probability against queue size, for same amount of total traffic 
but varying number of active middle-stage nodes. From top to bottom, the 
curves correspond to increasing number of active middle-stage nodes from 
m = 40, 50, 60, 70, 80. The number of inputs and outputs is n = 20 and 
interconnect link rate is 1/20. 

VI. Conclusion 

In this paper we revisit the Load-Balanced Router architec- 
ture, motivated by its potential of delivering energy propor- 
tionality for routers. We offer a detailed analysis on the queue 
lengths and packet delays under a simple random routing 
algorithm which is robust against all admissible traffic. This 
allows us to observe a trade off between performances such 
as queue size against energy consumption. 

Our paper does not focus on algorithms that optimize the 
number of active middle-stage nodes. We give a very simple 
argument for which the size of the active middle stage is 
proportional to the maximum traffic load over all inputs and 
outputs. It is an intriguing open question to see how one 
could make sure that the size of the active middle stage is 
proportional to the traffic average over all input, not to the 
maximum. 
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